In this notebook, we will introdcue a few R packages that can be handy to manipulate, summarize and plot the data coming from the PinAPL-py pipeline. Importantly, the methodologies presented here can be applied and expanded to any reasonnably sized dataset.
The packges used are loaded using the following set of commands. This assumes the packages have been installed using the install.package command or the Rstudio Tools>Install Packages menu.
library(dplyr) # data wrangling and dataframe manipulation
library(tidyr) # data wrangling and dataframe manipulation
library(reshape2) # conversion between long and wide dataframe formats
library(ineq) # calculating gini coefficient (and others)
library(ggplot2) # verstile plotting package
library(pheatmap) # clustering and plotting pretty heatmaps
In order to best benefit from this practise and use case, one needs to be familiar with R data structure and manipulation, mostly differentating vector, matrix and dataframes. A primer on the different data types (numeric, integer, character, etc) is also useful.
The data used in this workshop corresponds to a GECKO screen in a cancer cell line, looking for enrichment of shRNA in cells treated with a genotoxic drug for 1h. 4 libraries from each Baseline (T0), Treated (T3_T) and Untreated (T3_U) were sequenced. In the first part we will study the count files, in a second section the significance analysis.
Data Import and Formatting
The pipeline places the count data in different folders. While there is a way to have R crawl through mulitple folders and import the data, we will assume here, for convenience, that the counts files have been placed in the same “counts” folder and renamed after the library ID. We will import the non-normalized counts.
FiRst we write a function that takes the file name (and path), import the content into a dataframe, names the columns, and add a column containing the name of the file.
importCounts<-function(x){
f<-read.delim(x,header=FALSE)
colnames(f)<-c("sgRNA","gene", "value")
f$Library<-x
return(f)
}
Now we apply this function to a list of files in the folder.
files <- list.files(path="./counts",pattern=".tsv",full.names = TRUE)
CountSg<-sapply(files,importCounts,simplify=FALSE)
We get a list of dataframes, which we collapse into one giant dataframe to which we remove the rownames, since they are not relevant.
CountSg<-do.call(rbind,CountSg)
rownames(CountSg)<-NULL
We can now edit the dataframe and separate the library fields into subfields using the function tidyr::separate function
CountSg<-separate(CountSg, Library, sep="[./]",into=c("pre1","pre2","folder","Library","ext1","ext2"),remove=TRUE)
and we keep only the column we care about using dplyr::select
CountSg<-select(CountSg,sgRNA,gene,value,Library)
We have now a large dataframe containing all the counts data
Summary Statistics and Normalization
Here we will use mostly functions from the dplyr package. for example to determine the total number of counts in each library
totalCounts<-CountSg %>% group_by(Library) %>% summarize(Total=sum(value))
totalCounts
We can therefore normalized the counts to the total number of counts in each library and multiply by 1E6 to get an RPM (read per million) unit. For this we can use the function dplyr::mutate, which can assign tghe results of an operation to a new column. But before we specigy the grouping variable.
CountSg<-CountSg %>% group_by(Library) %>% mutate(RPM=value*1E6/sum(value))
We can now verify that all the libraries have the same sum of RPM and can be compared
CountSg %>% group_by(Library) %>% summarize(Total=sum(RPM))
plotting cumulative distribution of counts
Let’s now look at the distribution of RPM in each library. For this, we can use the ggplot2 package to plot the cumulative distribution function. Which is similar to the Lorenz plot generated by PinAPL-Py. Let’s start with one library.
let’s start with one library using the dplyr::filter command
C1<-filter(CountSg,Library=="Control_1")
A typical ggplot command starts with the ggplot fucntion, which specifies the dataframe and the variable to use in the “aesthetic” (x, y, colors, sizes, etc, etc). This function is them modified by adding the type of plot desired, here stat_ecdf(). The RPM values are following an exponential distribution (lots of low covered sgRNA), we will therefore plot their log10. In order to also represent the sgRNA that are not covered in the library (RPM=0), we will add a small number to all RPM. Finally, we specific the size of the plot (in inches) in the header.
ggplot(C1,aes(log10(RPM+0.01)))+stat_ecdf()

We can now change a bit the labels and the size of the font to make it prettier
ggplot(C1,aes(log10(RPM+0.01)))+stat_ecdf()+
ylab("Fraction of sgRNA")+
xlab("log10(RPM)")+
theme(text=element_text(size=20))

and now we can plot all libraies on the same plot, by starting from the full datafrane and adding a color variable to the aesthetic, (adn adjusting the)and adjusting panel wiodth to make room for the legend)
ggplot(CountSg,aes(log10(RPM+0.01),col=Library))+stat_ecdf()+
ylab("Fraction of sgRNA")+
xlab("log10(RPM)")+
theme(text=element_text(size=20))

Now, this is too many libraries to visualize on one plot, we shoudl be able to separate them by timepoint and condition. We need to add another field to the dataframe. For this we will use the function dplyr::mutate seen before and perfrom a logical test based on the name of the library using grepl. grepl returns TRUE if the queries string is present in the tested string, FALSE otherwise. The function ifelse will assign the type to the type variable, as a result of this test.
CountSg<-mutate(CountSg,type=ifelse(grepl("T0",Library),"Baseline","T3"))
CountSg<-mutate(CountSg,type=ifelse(grepl("T3_T",Library),"treated_T3",type))
CountSg<-mutate(CountSg,type=ifelse(grepl("T3_U",Library),"untreated_T3",type))
we can verify the new field, by listing the unique columns
CountSg %>% select(Library,type) %>% unique()
and we can now use the fucntion facet_wrap in ggplot to separate each type on a separate panel, increaseing the width (resp heigth) to make room if we display them in a row (resp. column)
ggplot(CountSg,aes(log10(RPM+0.01),col=Library))+stat_ecdf()+
ylab("Fraction of sgRNA")+
xlab("log10(RPM)")+
theme(text=element_text(size=20))+
facet_wrap(~type,ncol=3)

We can clearly see that some selection happened in the treated sample. Let’s calculte the gini coefficient for each library. this is provided by the function ineq::ineq
gini<-CountSg %>% group_by(Library,type) %>% summarize(gini=ineq(RPM,type="Gini"))
gini
Let’s plot the corresponding bar graph using the geom_bar function. Beause the value to plot is directly given by the gini field and that no math or calculations are needed, we need to specific stat=“identity”
ggplot(gini,aes(Library,gini,fill=type))+geom_bar(stat="identity")+
ylab("gini")+
xlab("Libraries")+
theme(text=element_text(size=14))

Now this is not pretty, we can rotate the x axis labels in the theme variable. But before we need to change the levels of the Name and type variable so that they are not plotted in alphabetical order.
gini$Library<-factor(gini$Library,levels=c("I1R1T0_1","I1R2T0_1","I2R1T0_1","I2R2T0_1","I1R1T3_U","I1R2T3_U","I2R1T3_U","I2R2T3_U","I1R1T3_T","I1R2T3_T","I2R1T3_T","I2R2T3_T"))
Now we can replot, rotatign the x axiss
ggplot(gini,aes(Library,gini,fill=type))+geom_bar(stat="identity")+
ylab("gini")+
xlab("Libraries")+
theme(text=element_text(size=14),axis.text.x=element_text(angle=45,hjust=1))

Multidimentional plotting
Principal component
Selection has happened on all treated samples, but it is not clear wether the same sgRNAs are being selected. We can investigate the similarity in RPM profile by calculating the fisrt two principal components. For this we first need to generate a matrix of sgRNA x Library using the function reshape2:dcast
CountSgMat<-dcast(data=CountSg,sgRNA~Library,value.var="RPM")
head(CountSgMat)
We can now convert this dataframe into a matrix, by assinging sgRNA field to rownames.
rownames(CountSgMat)<-CountSgMat$sgRNA
CountSgMat<-select(CountSgMat,-sgRNA)
CountSgMat<-as.matrix(CountSgMat)
now I can calcualte the principal components
pca<-prcomp(CountSgMat)
This is a list of two matrices, one for the standard deviation of each PC (pca\(sdev) and one with the values of each PC (pca\)rotation).
In order to use ggplot to plot the PC, I need to convert into a dataframe and add the name of each library as a new field to the pca
pca<-as.data.frame(pca$rotation)
pca<-mutate(pca,Library=rownames(pca))
I can now plot using ggplot
ggplot(pca,aes(PC1,PC2,col=Library))+geom_point()

Now let’s make it prettier
pca$Library<-factor(pca$Library,levels=c("I1R1T0_1","I1R2T0_1","I2R1T0_1","I2R2T0_1","I1R1T3_U","I1R2T3_U","I2R1T3_U","I2R2T3_U","I1R1T3_T","I1R2T3_T","I2R1T3_T","I2R2T3_T"))
ggplot(pca,aes(PC1,PC2,col=Library))+geom_point(size=3, alpha=0.5)+
theme(text=element_text(size=14))

NA
We clearly see that all Baseline are similar, all untreated are similar, but each treated is very different, with a very strong outlier. Can I use custome colors and add some labels?
ggplot(pca,aes(PC1,PC2,col=Library))+geom_point(size=3, alpha=0.5)+
theme(text=element_text(size=14))+
scale_color_manual(values=c(rep("grey",4),rep("steelblue",4),rep("tomato",4)))

NA
clustering and heatmap
We can use the same XX
---
title: "CRISPR screening - R practice"
output: html_notebook
---

In this notebook, we will introdcue a few R packages that can be handy to manipulate, summarize and plot the data coming from the PinAPL-py pipeline. Importantly, the methodologies presented here can be applied and expanded to any reasonnably sized dataset. 

The packges used are loaded using the following set of commands. This assumes the packages have been installed using the install.package command or the Rstudio Tools>Install Packages menu. 

```{r}
library(dplyr) # data wrangling and dataframe manipulation
library(tidyr) # data wrangling and dataframe manipulation
library(reshape2) # conversion between long and wide dataframe formats
library(ineq) # calculating gini coefficient (and others)
library(ggplot2) # verstile plotting package
library(pheatmap) # clustering and plotting pretty heatmaps
```

In order to best benefit from this practise and use case, one needs to be familiar with R data structure and manipulation, mostly differentating vector, matrix and dataframes. A primer on the different data types (numeric, integer, character, etc) is also useful. 

The data used in this workshop corresponds to a GECKO screen in a cancer cell line, looking for enrichment of shRNA in cells treated with a genotoxic drug  for 1h. 4 libraries from each Baseline (T0), Treated (T3_T) and Untreated (T3_U) were sequenced. In the first part we will study the count files, in a second section the significance analysis. 

# Data Import and Formatting

The pipeline places the count data in different folders. While there is a way to have R crawl through mulitple folders and import the data, we will assume here, for convenience, that the counts files have been placed in the same "counts" folder and renamed after the library ID. We will import the non-normalized counts. 

FiRst we write a function that takes the file name (and path), import the content into a dataframe, names the columns, and add a column containing the name of the file. 

```{r}
importCounts<-function(x){
  f<-read.delim(x,header=FALSE)
  colnames(f)<-c("sgRNA","gene", "value")
  f$Library<-x
  return(f)
}
```
Now we apply this function to a list of files in the folder.

```{r}
files <- list.files(path="./counts",pattern=".tsv",full.names = TRUE)
CountSg<-sapply(files,importCounts,simplify=FALSE)

```
We get a list of dataframes, which we collapse into one giant dataframe to which we remove the rownames, since they are not relevant.
 
 
```{r}
CountSg<-do.call(rbind,CountSg)
rownames(CountSg)<-NULL
```

We can now edit the dataframe and separate the library fields into subfields using the function tidyr::separate function


```{r}
CountSg<-separate(CountSg, Library, sep="[./]",into=c("pre1","pre2","folder","Library","ext1","ext2"),remove=TRUE)

```
and we keep only the column we care about using dplyr::select
```{r}
CountSg<-select(CountSg,sgRNA,gene,value,Library)
```

We have now a large dataframe containing all the counts data

# Summary Statistics and Normalization

Here we will use mostly functions from the dplyr package. for example to determine the total number of counts in each library

```{r}
totalCounts<-CountSg %>% group_by(Library) %>% summarize(Total=sum(value))
totalCounts
```

We can therefore normalized the counts to the total number of counts in each library and multiply by 1E6 to get an RPM (read per million) unit. For this we can use the function dplyr::mutate, which can assign tghe results of an operation to a new column. But before we specigy the grouping variable. 


```{r}
CountSg<-CountSg %>% group_by(Library) %>% mutate(RPM=value*1E6/sum(value))
```

We can now verify that all the libraries have the same sum of RPM and can be compared

```{r}
CountSg %>% group_by(Library) %>% summarize(Total=sum(RPM))
```


# plotting cumulative distribution of counts

Let's now look at the distribution of RPM in each library. For this, we can use the ggplot2 package to plot the cumulative distribution function. Which is similar to the Lorenz plot generated by PinAPL-Py. Let's start with one library.


let's start with one library using the dplyr::filter command

```{r}
C1<-filter(CountSg,Library=="Control_1")
```

A typical ggplot command starts with the `ggplot` fucntion, which specifies the dataframe and the variable to use in the "aesthetic" (x, y, colors, sizes, etc, etc). This function is them modified by adding the type of plot desired, here stat_ecdf(). The RPM values are following an exponential distribution (lots of low covered sgRNA), we will therefore plot their log10. In order to also represent the sgRNA that are not covered in the library (RPM=0), we will add a small number to all RPM. Finally, we specific the size of the plot (in inches) in the header. 

```{r fig.width=4, fig.height=4}
ggplot(C1,aes(log10(RPM+0.01)))+stat_ecdf()
```

We can now change a bit the labels and the size of the font to make it prettier

```{r fig.width=4, fig.height=4}
ggplot(C1,aes(log10(RPM+0.01)))+stat_ecdf()+
  ylab("Fraction of sgRNA")+
  xlab("log10(RPM)")+
  theme(text=element_text(size=20))
```
 and now we can plot all libraies on the same plot, by starting from the full datafrane and adding a color variable to the aesthetic, (adn adjusting the)and adjusting panel wiodth to make room for the legend)
 
```{r fig.width=5, fig.height=4}
ggplot(CountSg,aes(log10(RPM+0.01),col=Library))+stat_ecdf()+
  ylab("Fraction of sgRNA")+
  xlab("log10(RPM)")+
  theme(text=element_text(size=20))
```

Now, this is too many libraries to visualize on one plot, we shoudl be able to separate them by timepoint and condition. We need to add another field to the dataframe. For this we will use the function dplyr::mutate seen before and perfrom a logical test based on the name of the library using grepl. grepl returns TRUE if the queries string is present in the tested string, FALSE otherwise. The function `ifelse` will assign the type to the type variable, as a result of this test. 

```{r}
CountSg<-mutate(CountSg,type=ifelse(grepl("T0",Library),"Baseline","T3"))
CountSg<-mutate(CountSg,type=ifelse(grepl("T3_T",Library),"treated_T3",type))
CountSg<-mutate(CountSg,type=ifelse(grepl("T3_U",Library),"untreated_T3",type))
```
we can verify the new field, by listing the unique columns
```{r}
CountSg %>% select(Library,type) %>% unique()
```
 and we can now use the fucntion facet_wrap in ggplot to separate each type on a separate panel, increaseing the width (resp heigth) to make room if we display them in a row (resp. column)
 
 
```{r fig.width=8, fig.height=4}
ggplot(CountSg,aes(log10(RPM+0.01),col=Library))+stat_ecdf()+
  ylab("Fraction of sgRNA")+
  xlab("log10(RPM)")+
  theme(text=element_text(size=20))+
  facet_wrap(~type,ncol=3)
```

We can clearly see that some selection happened in the treated sample. Let's calculte the gini coefficient for each library. this is provided by the function ineq::ineq

```{r}
gini<-CountSg %>% group_by(Library,type) %>% summarize(gini=ineq(RPM,type="Gini"))
gini
```

Let's plot the corresponding bar graph using the geom_bar function. Beause the value to plot is directly given by the gini field and that no math or calculations are needed, we need to specific stat="identity"

```{r fig.width=5, fig.height=4}
ggplot(gini,aes(Library,gini,fill=type))+geom_bar(stat="identity")+
 ylab("gini")+
  xlab("Libraries")+
  theme(text=element_text(size=14))
```

Now this is not pretty, we can rotate the x axis labels in the theme variable. But before we need to change the levels of the Name and type variable so that they are not plotted in alphabetical order. 

```{r}
gini$Library<-factor(gini$Library,levels=c("I1R1T0_1","I1R2T0_1","I2R1T0_1","I2R2T0_1","I1R1T3_U","I1R2T3_U","I2R1T3_U","I2R2T3_U","I1R1T3_T","I1R2T3_T","I2R1T3_T","I2R2T3_T"))
```
Now we can replot, rotatign the x axiss

```{r fig.width=5, fig.height=4}
ggplot(gini,aes(Library,gini,fill=type))+geom_bar(stat="identity")+
 ylab("gini")+
  xlab("Libraries")+
  theme(text=element_text(size=14),axis.text.x=element_text(angle=45,hjust=1))
```

# Multidimentional plotting

## Principal component

Selection has happened on all treated samples, but it is not clear wether the same sgRNAs are being selected. We can investigate the similarity in RPM profile by calculating the fisrt two principal components. For this we first need to generate a matrix of sgRNA x Library using the function reshape2:dcast

```{r}
CountSgMat<-dcast(data=CountSg,sgRNA~Library,value.var="RPM")
head(CountSgMat)
```
We can now convert this dataframe into a matrix, by assinging sgRNA field to rownames. 
```{r}
rownames(CountSgMat)<-CountSgMat$sgRNA
CountSgMat<-select(CountSgMat,-sgRNA)
CountSgMat<-as.matrix(CountSgMat)
```



now I can calcualte the principal components

```{r}
pca<-prcomp(CountSgMat)
```
This is a list of two matrices, one for the standard deviation of each PC (pca$sdev) and one with the values of each PC (pca$rotation). 

In order to use ggplot to plot the PC, I need to convert into a dataframe and add the name of each library as a new field to the pca

```{r}
pca<-as.data.frame(pca$rotation)
pca<-mutate(pca,Library=rownames(pca))
```
I can now plot using ggplot 

```{r fig.width=5, fig.height=4}
ggplot(pca,aes(PC1,PC2,col=Library))+geom_point()
```

Now let's make it prettier

```{r fig.width=5, fig.height=4}
pca$Library<-factor(pca$Library,levels=c("I1R1T0_1","I1R2T0_1","I2R1T0_1","I2R2T0_1","I1R1T3_U","I1R2T3_U","I2R1T3_U","I2R2T3_U","I1R1T3_T","I1R2T3_T","I2R1T3_T","I2R2T3_T"))


ggplot(pca,aes(PC1,PC2,col=Library))+geom_point(size=3, alpha=0.5)+
  theme(text=element_text(size=14))
  
```

We clearly see that all Baseline are similar, all untreated are similar, but each treated is very different, with a very strong outlier. Can I use custome colors and add some labels? 

```{r fig.width=5, fig.height=4}

ggplot(pca,aes(PC1,PC2,col=Library))+geom_point(size=3, alpha=0.5)+
  theme(text=element_text(size=14))+
  scale_color_manual(values=c(rep("grey",4),rep("steelblue",4),rep("tomato",4)))
  
```

## clustering and heatmap

We can use the same XX



# Other resources

I enjoy the blog from Steven Turner "Getting Genetics Done" http://www.gettinggeneticsdone.com, which provides great example of R/bioconductor applied to genomics. 

For a more classic trainign in R and bioconductor, refered to the excellent tutorial from Thomas Girk http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual 

Finally, there are a number of chearsheet, that you may want to keep close to you. 
dplyr & tidyr https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
ggplot https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
colors https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf

